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We review the application of Statistical Mechanics methods to the study of online learning 
of a drifting concept in the limit of large systems. The model where a feed-forward network 
learns from examples generated by a time dependent teacher of the same architecture is 
analyzed. The best possible generalization ability is determined exactly, through the use of a 
variational method. The constructive variational method also suggests a learning algorithm. 
It depends, however, on some unavailable quantities, such as the present performance of the 
student. The construction of estimators for these quantities permits the implementation of 
a very effective, highly adaptive algorithm. Several other algorithms are also studied for 
comparison with the optimal bound and the adaptive algorithm, for different types of time 
evolution of the rule. 



I. INTRODUCTION 



The importance of universal bounds to generalization errors, in the spirit of the Vapnik- 
Chervonenkis (VC) theory, cannot be overstated, since these results are independent of target 
function and input distribution. This bounds are tight in the sense that for a particular 
target an input distribution can be found where generalization is as difficult as the VC 
bound states. However for several learning problems, by making specific assumptions, it 
is possible to go further. Haussler et al [hll have found tighter bounds that even capture 
functional properties of learning curves, such as for example the occurrence of discontinuous 
jumps in learning curves, which cannot be predicted from VC theory alone. 

These results were derived by adapting to the problem of learning, ideas that arise in 
the context of Statistical Mechanics. In recent years many other results |26|^ , p3t , bounds 
or approximations, rigorous or not, have been obtained in the learning theory of neural 
networks by applying a host of methods originated in the study of disordered materials. 
These methods permit looking at the properties of large networks, where great analytical 
simplifications can occur; and also, they afford the possibility of performing averages over the 
randomness introduced by the training data. They are useful in that they give information 
about typical rather than e.g. worst case behavior and should be regarded as complementary 
to those of Computational Learning Theory. 

The Statistical Mechanics of learning has been formulated either as a problem at thermo- 
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dynamical equilibrium or as a dynamical process off-equilibrium, depending on the type of 
learning strategy. Although many intermediate regimes can be identified, we briefly discuss 
the two dynamical extremes. Batch or offline methods essentially give rise to the equilibrium 
formulation, while online learning can be better described as an off equilibrium process. 

The offline method begins by constructing a cost or energy function on the parameter 
space, which depends on all the training data simultaneously |5(|[52|. Learning occurs by 
defining a gradient descent process on the parameter space, subject to some (thermal) noise 
process, which permits to some extent escaping from local traps. In a very simplified way 
it may be said that, after some time, this process leads to "Thermal equilibrium", when 
essentially all possible information has been extracted by the algorithm from the learning set. 
The system is now described by a stationary (Gibbs) probability distribution on parameter 
space. 

On the other extreme lies online or incremental learning. Instead of training with a cost 
function defined over all the available examples, the online cost function depends directly on 
only one single example, independently chosen at each time step of the learning dynamics 
jl]], (for a review, |0|). Online learning occurs also by gradient descent, but now the random 
nature of the presentation of the examples implies that at each learning step an effectively 
different cost function is being used. This can lead to good performances even without the 
costly memory resources needed to keep the information about the whole learning set, as is 
the case in the offline case. 

Although most of the work has concentrated on learning in stationary environments with 
either online or offline strategies, the learning of drifting concepts has also been modeled 
using ideas of Statistical Mechanics ^|^|[T^ . The natural approach to this type of problem is 
to consider online learning, since old examples may not be representative of the present state 
of the concept. It makes little sense, if any, to come to thermal equilibrium with possibly 
already irrelevant old data, as would be the case of an offline strategy. The possibility of 
forgetting old information and of preventing the system from reusing it, which are essential 
features to obtain good performance, are inherent to the online processes we will see below. 

We will model the problem of supervised learning, in the sense of Valiant [^| , of a drifting 
concept by defining a "teacher" neural network. Drift is modeled by allowing the teacher 
network parameters to undergo a drift that can be either random or deterministic. The 
dynamics of learning occurs in discrete time. At each time step, a random input vector 
is chosen independently from a distribution Px>, giving rise to a temporal stream of input- 
output pairs, where the output is determined by the teacher. From this set of data the 
student parameters will be built. 

The question addressed in this paper concerns the best possible way in which the informa- 
tion can be used by the student in order to obtain maximum typical generalization ability. 
This is certainly too much to ask for and we will have to make some restrictions to the 
problem. This question will be answered by means of a variational method for the following 
class of exactly soluble models: a feed-forward boolean network learning from a teacher 
which is itself a neural network of similar architecture and learns by a hebbian modulated 
mechanism. 

This is still hard and further restrictions will be made. The thermodynamic limit (TL) 
will always be assumed. This means that the dimension N of parameter space is taken to 
infinity. This increase brings about great analytical simplifications. The TL is the natural 
regime to study in Condensed Matter Physics. There the number of interacting units is of 
the order of N « 10 23 and fluctuations of macroscopic variables of order \/~N . In studying 
neural networks, results obtained in the TL ought to be considered as the first term in a 
systematic expansion in powers of 1/N. 

Once this question has been answered in a restricted setting, what does it imply for more 
general and realistic problems? The variational method has been applied to several models, 
including boolean and soft transfer functions, single layer perceptrons and networks with 
hidden units, networks with or without overlapping receptive fields and also for the case of 
non-monotonic transfer functions pri|p8[ j7 p^ , p^l|j30[ . In solving the problem in different cases, 
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different optimal algorithms have been found. But rather than delving in the differences, 
it is important to stress that a set of features is common to all optimal algorithms. Some 
of these common features are obvious or at least expected and have been incorporated into 
algorithms built in an ad hoc manner. Nevertheless, it is quite interesting to see them 
arise from theoretical arguments rather than heuristically. Moreover, the exact functional 
dependence is also obtained, and this can never be obtained just from heuristics. See J24|] for 
an explicitly Bayesian formulation of online learning which, in the TL seems to be similar 
to the variational method. 

The first important result of the variational program is to give lower bounds to the 
generalization errors. But it gives more, the constructive nature of the method furnishes 
also an 'optimal algorithm'. However, the direct implementation of the optimal algorithm 
is not possible, as it relies on the knowledge of information that is not readily accessible. 
This reliance is not to be thought of as a drawback but rather as indicating what kind 
of information is needed in order to approximate, if not saturate, the optimal bounds. It 
indicates directions for further research where the aim should be on developing efficient 
estimation schemes for those variables. 

The procedure to answer what is the best possible algorithm in the sense of generalization 
is as follows. The generalization error, in the TL, can be written as a function of a set of 
macroscopic parameters, sometimes referred to as 'order parameters', by borrowing the 
nomenclature from Physics. The online dynamics of the weights (microscopic variables) 
induces a dynamics of the order parameters, which in the TL is described by a closed 
set of coupled differential equations. The evolution of the generalization error is thus a 
functional of the cost function gradient which defines the learning algorithm. The gradient 
of the cost function is usually called the modulation function. The local optimization (see 
| pB for global) is done in the following way. Taking the functional derivative of the rate of 
decay of the generalization error, with respect to the modulation function, equal to zero, 
permits determining the modulation function that extremizes the mean decay at each time 
step. This extreme represents, in many of the interesting cases a maximum ( see |3l| for 
exceptions). We can thus determine the modulation function, i.e. the algorithm, that leads 
to the fastest local decrease of the generalization error under several restrictions, to be 
discussed below. 

In this paper several online algorithms are analyzed for the boolean single layer perceptron. 
Other architectures, with e.g. internal layers of hidden units, can be analyzed, although 
there is a need for laborious modifications of the methods. Examples of random drift, 
deterministic evolution, changing drift levels and piecewise constant concepts are presented. 
The paper is organized as follows. In section II, the variational approach is briefly reviewed. 
In section III, analytical results and simulations are presented for several algorithms in the 
cases of random drift and deterministic 'worst-case' drift, where the teacher flees from the 
student, in weight space. The asymptotics of the different algorithms are characterized by 
a couple of exponents, /3, the learning exponent and S, the drift or residual exponent. A 
relation between these exponents is obtained. A practical adaptive algorithm is discussed 
in section IV, where it is applied to a problem with changing drift level. In section V, the 
Wisconsin test for perceptrons is studied. Numerical results for the piecewise constant rule 
are presented. Concluding remarks are presented in section VI. 



II. THE VARIATIONAL APPROACH 



The mathematical framework employed in the statistical mechanics of online learning and 
in the variational optimization are quickly reviewed in this section. We consider only the sim- 
ple perceptron with no hidden layer. For extensions to other architectures see |il^j7 p^ , |3l|j3C| ] 
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A. Preliminary definitions 



The boolean single layer perceptron is defined by the function ob — sign(B ■ S), with 
S £ 1Z N , parametrized by the concept weight vector B £ 1Z N also called synaptic vector. 

In the student-teacher scenario that we are considering, a perceptron (teacher) generates 
a sequence of statistically independent training pairs C = {(S m ,ct^) : fi = l,...,p}, and 
another perceptron (student) is constructed, using only the examples in £, in order to infer 
the concept represented by the teacher's vector. The teacher and student are respectively 
defined by weight vectors B and J with norms denoted by B and J. 

In the presence of noise, instead of as, the student has access only to a corrupted version 
ob- For example, for multiplicative noise, each teacher's output is flipped independently 
with probability x |||,@: 

P(o B \o B ) = (1 - x)8(vB,d- B ) + X$(°b, -ob) , 

where o B = sign(y), and y = B • S/B is the normalized field. The Kronecker S is I (0) only 
if the arguments are equal (different). In the same way, for the student, the field x = J ■ S/J 
and the output oj = sign(x) are defined. 

The definition of a global cost function Ec(J) = ^2 E ,J '(3), over the entire data set C, 
is required for batch or offline learning. The interaction among the partial potentials _E M (J) 
may generate spurious local minima, leading to metastable states and possibly very long 
thermalization times. This can be avoided to a great extent by learning online. 

We define a discrete dynamics where at each time step a weight update is performed 
along the gradient of a partial potential -B M (J), which depends on the randomly chosen fj, th 
example. This random sampling of partial potentials introduces fluctuations which tend 
to decrease as the system approaches a (hopefully) global minimum. That process has 
been recently called self-annealing Jl^] in opposition to the external parameter dependent 
simulated annealing. The general conditions for convergence of online learning to a global 
minimum, even in stationary environments, is an open problem of major current interest. 

The online dynamics can be represented by a finite difference equation for the update 
of weight vectors. For each new random example, make a small correction of the current 
student, in the direction opposite to the gradient of the partial potential and also allow for 
a restriction of the overall length of the weight vector to prevent runaway behavior: 



Here the partial potential Z? M is a function of the scalars that are accessible to the student 
(field x, norm J and output ob) and corresponds to the randomly sampled example pair 
(S M ,5-^). The time scale At must be At ~ 0(1/ N) so that in the TL we can derive well- 
behaved differential equations, in general we will choose At = 1/N. The second term allows 
to control the norm of the weight vector J. 

It is simple to see that Vj-E M = (dE 11 /dx) Vjx. The calculation of the gradient and the 
definition dE^ /dx = — W A '(V)rTg finally lead to the online dynamics in the form: 



Note that each example pair (S M ,cr^) is used only once to update the student's synapses, 
<TgS M is called the hebbian term and the intensity of each modification is given by the 
modulation function W. The prefactor Jo B can be absorbed into the modulation function, 
but has been explicitly written for convenience. The single most important fact is that 
the relevant change is made along the direction of the input vector S M . The reasons for 
this restriction are the following: the optimal algorithms to be discussed bellow are bounds 



= J M - Atn"J" - AtVjE 1 * . 




only on the TL. In the absence of site correlations, i.e. (SiSj) = 0, the prefactor of S M is a 
diagonal matrix [[mJ . Furthermore the class of modulated hebbian algorithms are interesting 
even for finite JV from a biological perspective. The symbol V denotes the learning situation, 
that is, the set of quantities that we are allowed to use in the modulation function, that is, 
the available information. For boolean perceptrons, V may contain the corrupted teacher's 
output ob, the field x, and as discussed below, some information about the generalization 
error. We can still study the restrictions V = {S'b}, V = {ob,oj} and V = {o B , | x |}. 
Evidently, the more information the student has, the better we expect it to learn. 

We consider a specific model for concept drift introduced by Biehl and Schwarze |^| . The 
drift scale that can be followed by an online learning system can not be to large for it 
would be impossible to track, but if too slow it trivially reduces to an effectively stationary 
problem. Their choice, which makes the problem interesting, is as follows. At each time 
step the concept vector B suffers the influence of the changing environment and evolves as 

BM+1 = ( : - w) BM + ^ > 

where A controls the norm B and ff £ 1Z N is the drift vector. Random and deterministic 
versions of if will be considered in section 4. 

The performance of a specific student J in a given concept B can be measured by the 
generalization error eG that is defined as the instantaneous average error e — |(1 — ojob) 
(ob is the non-corrupted output) over inputs extracted from the uniform distribution Pu(S) 
with support over the hyper-sphere of radius yN : 

e G (J,B)= / dSP u (S) e(aj(S),a B (S)) . 



We make explicit the difference between eG and the prediction error ep, which measures 
the average ep = — ojob)), over the true distribution of examples Pd- It is not difficult 
to see that the expression for eG is invariant under rotations of axes in 1Z N , therefore the 
integral (Q) depends only on the invariants p = B ■ J/BJ, x, y, B and J. In the TL a 
straightforward application of the Central Limit Theorem leads to: 

, . /",,„/ n Q(—xy) 
e G{p) = / dxdyP c (x,y) 



I 

— arccos p . 



Pc{x,y) is a Gaussian distribution in TZ 2 with correlation matrix 



Note that p is a parameter in the probability distribution describing the fields, and J 
and B define the scale of the same fields. In statistical physics, that variables are called 
macroscopic variables. 




FIG. 1. Simple representation of weight vectors in the hyper-sphere. The teacher and the student disagree when 
the input vector S is inside the shaded region. 



The intuitive meanings of p and of the Eq. |6| can be verified with the help of Fig. gj, 
observing that p — cos 6 and that for the boolean perceptron the weight vectors are normal 
to a hyper-plane that divides the hyper-sphere in two differently labelled hemispheres, it is 
easy to see that the student and the professor disagree on the labeling of input vectors S 
inside the shaded region, thus trivially eo — 8/n = - arccosp . 



B. Emergence of the macroscopic dynamics 



The dimensionality of the dynamical system involved can be reduced by using (^) and (^) 
to write a system of coupled difference equations to the macroscopic variables: 



II /', v) ( tlfk - A") - I(M/"(V)) 2 



1 



N 



71 3£W M (V) 



1 / . R' 1 



2BN 2 \N 2 J 



(7) 
(8) 
(9) 



In the above equations the local stability A = xctb was introduced. Positive stability 
means that the student classification aj = sign(x) agrees with the (noisy) learning data ob- 

The usefulness of the TL lies in the possibility of transforming the stochastic difference 
equations into a closed set of deterministic differential equations Jl^,^5). The idea is to 
choose a continuous time scale a such that for the TL regime p/N — ► a, where p is the 
number of examples already presented. The equations are then averaged over the input 
vectors S and drift vector ff distributions, leading to: 
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(11) 
(12) 



where (...) = J dff dS (. . .)P(rf, S) and the definitions Cs-q = limjv_> 00 (S • ff)/(NB) and 
C vv — lim]v^cx)(j? • ff)/(2BN) have been used. 



G 



The fluctuations in the stochastic equations vanish in the TL and the above equations 
become exact (self-averaging property). This can be proved by writing the Fokker-Planck 
equations for the finite N stochastic process defined in (Q), (^|) and (H), and showing that 
the diffusive term vanishes in the TL |22(]. 



C. Variational optimization of algorithms 



The variational approach was proposed in [[lfj as an analytical method to find learning 
algorithms with optimal mean decay (per example) of the generalization error. The same 
method was applied in several architectures and learning situations. 



The idea is to write: 



deG _ dec dp 
da dp da ' 



(13) 



and use equation (hOl) to build the functional: 



da 1 1 dp 



( y^jL 



(14) 



Thus, the optimization is attained by imposing the extremum condition: 
^V)(^^)„^°' 



(15) 



where 8/SW(V) stands for the functional derivative in the subspace of modulation functions 
W with dependence in the set V. The above equation can be solved observing that dec /dp 7^ 
and 



SW(V) 



{f(TL,V)W n (V)) = n{f(H, V))n\v W n ~\V) 



(16) 



/ is an arbitrary function, TL is the set of hidden variables, that is, in contrast with the 
set V, the variables not accecible to the student (e.g. fields y, drift vector ff, etc ...) and 
(■■■)n\v = f dHP(H\V)..., where for a given set TL = {ai,a,2,~}, dTL = daida2.... The 
solution is given by: 

W*(V) = ( iy+Csr,) * B -A) . (17) 



By writing y + Cs-n = (B + ff) ■ S/\/~B it can be seen that the optimal algorithm "tries" to 
pull the example stability A, not to an estimative of the present teacher stability, but to 
one already corrected by the effect of the drift ff. It seems natural to concentrate on cases 
where the drift and the input vectors are uncorrelated (Cs^ = 0). 

The optimization under different conditions of information availability, i.e., different spec- 
ifications of the sets V and TL, leads to different learning algorithms. This can be seen by 
performing the appropriate averages in (|l7|), as we proceed to show: 



Annealed Hebb Algorithm: Suppose that the available information is limited to the corrupted teacher output. 
This corresponds to the learning situation such that TL = {y, ff, \x\, aj} and V = {<tb}- Considering that 
Cs>7 = 0, we need to perform the average: 

W"{»B) = ^-{y-px) {x ^- B , (18) 
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which involves J dy y P(y\ 5b) ; J dx x P (x \ as) ■ The probability distributions are easily obtained using 
Bayes theorem: 

p(v I **) = ^f^ - (19) 

J dy P{a B | y) 

By the Central Limit Theorem we know that, in the TL, P(y) and P(x) are Gaussians with unit variance and 
it is not difficult to verify, using (|fj), that : 

P(&b I y) = \ + (1 - X)e(^y); P(? s | x) = | + (1 - X )H(-^), (20) 

where A = ^/T~~p^/p and = -^e"' 2 / 2 . 

It follows that 

(y>{x, H }|?B = yf^^ - X); Oa> { »,»ne fl = ^1 5b(1 - X)P 2 - (21) 
Combining the above results in (JlSI) finally gives : 



W A h(<tb;p,x) = \/-*p(1-x)- (22) 



The weight changes are proportional to the Hebb factor, but the modulation function does not depend on the 
example stability A (see Fig. |^). Hence the name Hebb. However this function is not constant in time, the 
temporal evolution (annealing) is automatically incorporated into the modulation function. Optimal annealing 
is achieved by having the modulation function depend on the parameter p, the normalized overlap between the 
student J and the concept B. Since this quantity is certainly not available to the student, there will be a need 
to complement the learning algorithm with an efficient estimator of the present level of performance by the 
student (see Sec. IV). 




FIG. 2. Modulation functions exemplified for p — 0.9 with noise levels x — (top) and x = 0.1 (bottom): Annealed 
Hebb (dots), Step (dashes), Symmetric (dashes-dots) and Optimal (solid). 
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Step Algorithm ( Q): This algorithm, obtained under the restriction Ti = {y,fj, 

\x\} and V = {ctb,<?.j}, is a close relative to Rosemblatt's original perceptron algorithm, which works by error 
correcting and treats all the errors in the same manner. There are two important differences, however, since 
correct answers also cause (smaller) corrections and furthermore, the size of the corrections evolves in time in 
a similar manner to the annealed Hebb algorithm. 
The modulation function is 

Wste P {<TB,<Tj\P,x) = ^=A 2 p(i-x)- 1 . (23) 

V I & + ^ J 1 ' arccos(— poboj) I 

Note that the Step Algorithm has access to the student's output and can differentiate between right (A > 0) 
and wrong (A < 0) classifications, the name arises from the form of its modulation function (Fig. H). The 
annealing increases the height of the step, i.e the difference between right and wrong, as the overlap p goes to 
one. 



Symmetric Weight Algorithm (see [15|): This is the optimal algorithm for the learning situation described by 



Ti — {y,ff,cFj} and V = {5\b, |a;|}. The resulting modulation function is given by: 

Wsw(*B,\x\;p, X ) = ^X(l- X )e- x2/2X2 . (24) 

That algorithm cannot discern between wrong and right classifications, but only differentiates between "easy" 
(large | A ]) and "hard" (small A |) classifications, concentrating the learning in "hard" examples (Fig. ^|). 

Optimal Algorithm (see When all the available information is used we have the learning situation 

described by Ti = {y, ff} and V = {5\b, \x\, aj}. the optimal algorithm is then given by: 

1 e -A 2 /2A 2 

W OPT (A = * BX ; P, x) = A(l - X ) x/2 + (1 _ x)H{ _ A/A) . (25) 



In the presence of noise, a crossover is built into the Optimal modulation function. This crossover is from a 
regime where the student classification is not strongly defined (A negative but small) - and the information 
from the teacher is taken seriously - to a regime where the student is confident on its own answer - and 
any strong disagreement (very negative A) with the teacher will be attributed to noise, and thus effectively 
disregarded. The scale of the stabilities where the crossover occurs depends on the level of performance p and 
therefore is also annealed. 



The learning mechanisms are highly adaptive and remain the same in the case of drifting 
rules, where the common features described above, mainly the p dependent annealing, lead 
automatically to a forgetting mechanism without the need to impose it, based on heuristic 
expectations, in an ad hoc manner. 

It is interesting to note that the heuristically proposed algorithms are approximations of 
these optimized modulation functions. For instance, simple Hebb rule is the Annealed Hebb 
when ff = 0, since it can be shown in this case that WJ = 1, and corresponds to the p — > 
regime of all the optimized algorithms; Rosemblatt Perceptron algorithm is qualitatively 
similar to the Step algorithm; Adatron approximates the Optimal algorithm for \ — 

and p — *■ 1; OLGA Jl4| and Thermal perceptron jioj algorithms resemble the Optimal 
modulation with x > 0. 



III. LEARNING DRIFTING CONCEPTS 



The important result from last section is that, under the assumption of uncorrelated drift 
and input vectors, the modulation functions do not depend on the drift parameters (in 
contrast with the explicit dependence on the examples noise level \). So, they are expected 
to be robust to continuous or abrupt, random or deterministic, concept changes. In this 
section simple instances of drifting concepts are examined; abrupt changes are studied in 
Sec. V. 
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A. Random drift 



In this scenario, the concept weight vector B performs a random walk on the surface of a 
iV-dimensional unit sphere. The drift vector has random components with zero mean and 
variance 2D, 

Wig) = 2DMm, , 



(26) 



The condition B M+1 • B M+1 = 1 is imposed in |l|) by considering that B° ■ B° = 1 and with 
the choice A = ff • B + D. 



0.30 



0.26 



0.22 



0.18 




10.0 



FIG. 3. Integration of learning equations and simulation results (JV = 5000) for random drift D — 0.1: Annealed 
Hebb (triangles), Symmetric (white circles), Step (black circles) and Optimal (white squares). The self- averaging 
property is clear since the simulation results refer to only one run. 
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FIG. 4. Asymptotic error e'a(D) for random drift: Annealed Hebb (triangles), Symmetric (white circles), Step 
(black circles) and Optimal (white squares). 



The order of magnitude of the scaling of the drift vector with N is important since it gives 
the correct time scale for non trivial behavior: if smaller, the drift would be irrelevant in the 
time scale of learning, while if larger it would not allow any tracking. In the relevant regime, 
the task in nontrivial also in another sense: the autocorrelation of the concept vector decays 
exponentially in the a-scale, (B(a)B(a')) oc e~ D ^ a ~ a K 

For the optimized algorithms, the equation for p decouples from the equation for J. After 
the proper averages, the learning equation for this type of drift reduces to 



£=P{W^-A)--W<)- P D, (27) 



This equation can be solved for each particular modulation function W . The general- 
ization error eg = i arccos(p) for the several algorithms described in the last section are 
compared in Fig. |^ for the noiseless case. Solid curves refer to integrations of the above 
learning equation and symbols correspond to simulation results. Although the rule is con- 
tinuously changing, it can be tracked within an stationary error eg which depends on the 
drift amplitude D. The functions eg^D) for the various algorithms can be found from the 
condition dp /da — and are shown in Fig. ^. 

TABLE I. Small drift exponents: Random case 





eg 1 CD) 


e%(D = yX = 0) 


Annealed Hebb 


(^) 1/4 ^0.42D 1 / 4 


0.40 or 112 


Symmetric 


(^f ^V/ 3 « 0.52 D 1 ' 3 


i.4i or 1 


Step 


(4)1/3 D 1 ' 3 K0.51D 1 ' 3 


1.27 a-1 


Optimal 


L 2 A _) 1/3 # 1/3 &0.45D 1 ' 3 


0.88 a" 1 
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This equation can be solved for each particular modulation function W. The general- 
ization error ec = — arccos(p) for the several algorithms described in the last section are 
compared in Fig. |^ for the noiseless case. Solid curves refer to integrations of the above 
learning equation and symbols correspond to simulation results. Although the rule is con- 
tinuously changing, it can be tracked within an stationary error eg" which depends on the 
drift amplitude D. The functions e<2? (D) for the various algorithms can be found from the 
condition dp/ da = and are shown in Fig. ^. The behavior for small drift is shown in Table 
1. Note the abrupt change in the exponents due to the inclusion of more information than 
the output cfb- The behavior for small drift is shown in the Table 1. Note the abrupt change 
in the behavior due to the inclusion of more information than the output ob- 



B. Deterministic drift 



In the learning scenario considered so far, the worst case drift will occur when at each time 
step the concept is changed deterministically so that the overlap with the current student 
vector is minimized. In this situation, previously examined by Biehl and Schwarze M, the 
new concept is chosen by minimizing B'' 4 " 1 • J M subject to the conditions 

B M + 1 . B M . = 1 _ D_ ( (2g) 



where now D is the drift amplitude for the deterministic case. Note the different scaling 
with N for non trivial behavior. 



TABLE II. Small drift exponents: Deterministic case 







e%(D = yX = 0) 


Annealed Hebb 




0.40 a" 1/2 


Symmetric 


1/4 « 0.80 D 1 ^ 


1.41 a" 1 


Step 


2^l£i/4 ~ o.76 D 1,/4 


1.27 a" 1 


Optimal 


(^) 1/4 ^OMD 1 ' 4 


0.88 a" 1 
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Clearly B M+1 lies in the same plane which contains B M and J'\ The solution of this 
constrained minimization problem is 



B 



aB M - , 
D 



(29) 



a = 1 - — + bJp, 



b = 



JN 



2D - (D/N) 2 



-i 1/2 



1 - p 2 



In terms of A and fj of eq. |J is 

A=(l-o)JV, f}=-bJN. 



(30) 



The learning equation reduces to 



(31) 



and the analysis is similar to the previous subsection. Theoretical error curves confirmed by 
simulations are shown in Fig. JE] for the different algorithms with fixed drift amplitude. The 
stationary tracking error eg^D) is shown in Fig. 



0.35 



0.30 



e G 0.25 



0.20 



0.15 




10.0 



FIG. 5. Integration of learning equations and simulation results (N = 5000) for deterministic drift D = 0.01: 
Annealed Hebb (triangles), Symmetric (white circles), Step (black circles) and Optimal (white squares). 



13 



0.35 




0.00 0.01 0.02 0.03 0.04 0.05 

D 



FIG. 6. Asymptotic error eg (D) for deterministic drift: Annealed Hebb (triangles), Symmetric (white circles), 
Step (black circles) and Optimal (white squares). 



The behavior for small drift D is shown in Table III B . Again we can note the occurrence 
of an abrupt change in the exponents after the inclusion of information on the student's 
fields. 



C. Asymptotic behavior: Critical exponents and universality classes for unlearnable problems 



A simple, although partial, measure of the performance of a learning algorithm can be 
given by the asymptotic decay of the generalization error in the driftless case and alter- 
natively by the residual error dependence on D in the presence of drift. These are not 
independent aspects, but rather linked in a manner reminiscent of the relations between the 
different exponents that describe power law behavior in Critical Phenomena. 

In the absence of concept drift, the generalization error decays to zero when a approaches 
q c (which here happens to be infinity but may have a finite value in other situations 0|) 
as a power law of the number of examples with the so called learning or static exponent 13, 

ecxr' 3 , (32) 



where r = — . Thus, we may think of ea as a kind of order parameter in the sense that 

it is a quantity which changes from zero to a finite value at r = 0. We may think of 1/a as 
the analogous of the control parameter T (temperature) in critical phenomena. 

Any amount of drift in the concept, changes the problem from a learnable to an unlearn- 
able one, with a residual error eg = ec(r = 0). We have seen that this behavior at the 
critical point r = also obeys a power law 

eg oc D 1/s , (33) 



where S has been called the drift exponent. 
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In principle, we can classify different unlearnable situations (due to say, various kinds of 
concept drift), by the two exponents /3 and 8. Different algorithms and learning situations 
may have the same exponents. We can thus define, in a spirit similar to that in the study 
of Critical Phenomena the so called universality classes of behavior. In this paper we have 
seen four classes of behavior (/?, 8), the combinations (1/2, 4), (1, 3) for the random drift case 
and (1,4), (1/2,6) for the deterministic case. In the absence of drift ft = 1 for the Hebb 
algorithm, and (3 — 1 for the symmetric, step and optimal algorithms we have introduced 
above. There exist, however, other classes. For example, for the standard Rosemblatt 
Perceptron algorithm with fixed learning rate we have /3 = 1/3; then, 8 — 5 for random drift 
and 8 — 8 for deterministic drift. 

We have observed that, in general, the two exponents are independent. However, if a 
simple condition holds, then there exists a relation connecting /3 and <5 for each kind of drift. 
For example, as can be compared with the above results, 

8 = — + 2 , (random drift) 

P 
2 

8 = — + 2 , (deterministic drift) . 



To derive these relations we remember that ea oc yj\ — p 2 when p — *■ 1, so that we may 
write in this limit the learning equation as 

^£ « CD m ea n ~ C*i(D)e^ - C 2 (D)e G 2 

where C is a constant, Ck(D) are functions of D and n and nt are positive numbers. Now, 
denote by C*(D) the first function which survives in the limit D — » 0, C*(D — *• 0) = C». 
Then, the learning equation is 

Hp 

e G (a) w (C*(n« - l)ar 1/(n *~ 1) (n« > 1) , 
ec(a) oc e~ c '* a (n» = 1) . 



Thus, 

/8 = l/(n.-l), 

with j3 — > oo denoting exponential decay. 

In the presence of small drift, we may write Ci(D) oc Z7 mi . The stationary condition 
deG /da = leads to 

e%(D)~D^ , 

r ni + n 
o = . 

m — mi 

This shows that, in principle, the two exponents are independent. However, if happens that 
the first surviving function is C*(D) = Ci(D) (which seems to be a very common situation), 
then n* = m and mi = 0, so that 

'=£(£ + < 1+B >) ■ 

The relations given by Eq. (^) follow from the fact that n = l,m = 1 for random drift 
and n = and m = 1/2 for deterministic drift. Other drift scenarios may define other 
universality classes. 
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In the case where C» 7^ Ci, we only can conclude that 

5 >^ + (1+n) ) ■ 

It is important to note an exponential decay of the error (/? = 00) leads to the limiting value 
8 = 2 both for deterministic as random drift. It is know that the error cannot decay faster 
than exponential in these learning problems. 



IV. PRACTICAL CONSIDERATIONS 



The most important question that can be raised in the implementation of the variational 
ideas as a guide to construct algorithms is how to measure the several unavailable quan- 
tities that go into the construction of the modulation function. The problem of inferring 
the example distribution will not be considered and only a simple method to measure the 
student-teacher overlap p will be presented. This is done by adding a 'module' to the per- 
ceptron in order to estimate online the generalization error, as studied in Algorithms 
that rely on this kind of module are quite robust with respect to changes in the distribution 
of examples and even to lack of statistical independence |20|. Consider an online estimator 
(a 'running average') which uses the instantaneous error e M = (1 — <Tg<Tj)/2 to update the 
current estimate of the generalization error: 

= (i _ " )g G0 + " . 

This estimator incorporates exponential memory loss through the u parameter. In the 
perceptron, due to the factor A = tan(7rec) that appears in the modulation function, fluctu- 
ations around ec ~ \ may lead to spurious divergences. Therefore it is natural to consider 
the truncated Taylor expansion 

a 12 

A fc = tan (fe) (7re G ) = 7re G + -(7reG) 3 + — (nee) 5 + ■ ■ ■ + c k (-Kea) k ■ 

Then, the modulation function for an adaptive algorithm inspired in the noiseless optimal 
algorithm is 

W(\k, A M ) = L_ A " A exp(-4|) . 
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In figure |7| we present the results of applying this algorithm to a problem where the drift 
itself is non-stationary. We have dubbed this non-stationarity drift acceleration. The algo- 
rithm is quite uninterested in the particular type of drift acceleration, and as an illustration 
we chose a drift given by D = Do sm?{2-Kvt). The adaptive algorithm makes no use of 
this knowledge. There has been no attempt at optimizing the estimator itself, but a rea- 
sonable and robust choice is uj = 2 and k — 3. Simulations were done for N = 1000, a size 
regime, where for all practical purposes, the central limit theorem holds. Note that the Hebb 
algorithm is not able to keep track of the rule since it has no internal forgetting mechanism. 

We have not studied the mixed case of drift in the presence of noise. The nature of the noise 
process corrupting the data is essential in determining the asymptotic learning exponent 
(0). While multiplicative (flip) noise does not alter /3 for the optimized algorithms, additive 
(weight) noise does. This extension deserves a separate study. See [^| for the behavior of the 
optimized algorithm and noise level estimation in the presence of noise acceleration in the 
absence of drift, see also Jl^] where it is shown that learning is possible even in the mixed 
drift-noise case. 



V. THE WISCONSIN TEST FOR PERCEPTRONS: PIECEWISE CONSTANT RULES 



How do the algorithms studied in the previous sections perform in the case of abrupt 
changes (piecewise constant rules)? The interest is in determining how do the optimal 
algorithms fare in a task for which they were not optimized. The Wisconsin test (WT) for 
perceptrons (WTP) to be studied here was inspired in what is called the WT and is used in 
the diagnostics of pre-frontal lobe (PFL) syndrome in human patients and which will now 
be described very briefly (for details see e.g. [p7pl|). 
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FIG. 8. Simulations for N = 500 in single runs. Bottom line: lower bound given by the Optimal algorithm with 
the true values of A. Symbols: Optimal algorithm (white squares), Step algorithm (black circles), Symmetric Weight 
algorithm (white circles), Annealed Hebb (white triangles), all with uj — 2 and k = 3 estimator and and Hebb (black 
triangles) . 



Consider a deck of cards, each one has a set of pictures. The cards can be arranged in 
several different manners into two categories. The different possible classifications can be 
done according, e.g. to color (black or red pictures), to parity (even or odd number of figures 
in the picture), to sharpness (figures can be round or pointed) etc. The examiner chooses a 
rule and a patient is shown a sequence of cards and asked to classify them. The information, 
whether the patient presumed classification is correct or not, is made available before the 
next presentation. After a few trials (5-10) normal subjects are able to infer the desired 
rule. PFL patients are reported to infer correctly the rule after as little as 15 trials. Now 
a new rule is chosen at random by the examiner but the patient is not informed about the 
change. Normal patients are quick to pick up the change and after a small number of trials 
(5-10) are again able to correctly classify the cards. PFL patients are reported to persevere 
in the old rule and after as much as 60 trials still insist in the old classification rule. 

Our WTP is designed as a direct implementation of these ideas, by considering learning 
of a piecewise constant teacher, without resetting the couplings to tabula rasa, i.e, without 
letting the patient know that the rule has changed. 

Fig. Q shows the results of simulations with the adaptive algorithm of (fl5|). The rule 
is constant up to a = 10, it then suddenly jumps to another, uncorrelated vector and stays 
again unchanged until a = 20 and so on. The most striking feature is that the perceptron 
with pure Hebbian algorithm works quite efficiently for the first rule but perseveres in that 
state and adapts poorly to a change. It can not detect performance degradation and is not 
surprised by the errors. The reason for that is that the scale of the weight changes is the 
same independently of the length of the J vector. The other algorithms are able to adapt to 
the new conditions in as much as they incorporate the estimate of the performance of the 
student. 
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VI. CONCLUSIONS 



The necessary ingredients for the online tracking of drifting concepts, adaptation to non- 
stationary noise etc., emerge naturally and in an integrated way in the optimized algorithms. 
These ingredients have been theoretically derived rather than heuristically introduced. Many 
of the current ideas in machine learning of changing concepts can be viewed as playing a 
role similar to the ideal features discussed here for the perceptron. Among the important 
ideas arising from the variational approach are: 

Learning algorithms from first principles: For each one of these simple learning scenarios an ideal and 
optimal learning algorithm can be found from first principles. These optimized algorithms do not have arbitrary 
parameters like learning rates, acceptance thresholds, parameters of learning schedules, forgetting factors etc. 
Instead, they have a set of features which play similar roles to those heuristic procedures. The exact form of 
these features may suggest new mechanisms for better learning. 

Learn to learn in changing environments: The Optimal modulation function W indeed represents a 
parametric family of functions, the (non-free) parameters being the same as those present in the probability 
distribution of the learning problem: J, p, x, etc. The modulation function changes during learning, that is, 
the algorithm moves in this parametric space during the learning process. The student "learns to learn" by 
online estimation of the necessary parameters of its modulation function. 

Robustness of optimized algorithms: Historically, a multitude of learning algorithms has been suggested 
for the perceptron: Rosemblatt's Perceptron, Adaline, Hebb, Adatron, Thermal Perceptron, OLGA, etc. From 
the variational perspective, these practical algorithms can be viewed as more or less reliable approximations of 
the ideal ones in the TL. For example, simple Hebb corresponds to the Optimal algorithm in the limit p — > 0; 
Adatron (Relaxation) algorithm is related to the limit p — * 1; OLGA jl4| and Thermal perceptron includes 
an acceptance threshold which mimics the optimal algorithm in the presence of multiplicative noise \- Thus, 
although the optimal algorithms are derived for very specific distributions of examples, it does not mean that 
they are fragile, non-robust when applied in other environments. They are indeed very robust |20j, at least 
for the environments in which the standard algorithms work, since these practical algorithms are 'particular 
cases' of a more general modulation function. But since new learning situations (new types of noise, drifting 
processes, general non-stationarity etc.) can be theoretically examined from the variational viewpoint, it is 
possible that new features emerge, and that these suggest new practical ideas for more robust and efficient 
learning. 

Emergence of 'cognitive' modules: Have the variational ideas any relevance to 'Biological Machine Learn- 
ing'? Probably not for the biological structures, which are produced by opportunistic 'evolution bricolage', 
but perhaps they might apply in understanding biological cognitive functions. The variational approach brings 
forth a suggestion that, even if not new, acquires a more concrete form due to the transparent nature of the 
simple models studied: optimization of the learning ability leads to the emergence of 'cognitive functional mod- 
ules ', here defined as components of the modulation function and accessory estimators of relevant guantities of 
the probability distribution related to the learning situation. A tentative list of such estimators suggested by 
the variational approach may be: a) a mismatch (surprise) module for detection of discrepant examples; b) 
an emotional/attentional module for providing differential memory weight for these discrepant examples; c) 
'constructivist' filters which accommodate or downplay the highly discrepant data; d) noise level estimators for 
tuning these filters; e) a working memory system for online estimation of current performance which enables 
detection of environmental changes. In conclusion, the variational approach suggests that the necessity of 
certain cognitive functions may be related to statistical inference principles already present in simple learning 
machines. 

Extensions: All the results here presented have been obtained under a rather severe set of restrictions from 
a practical point of view. The main points concern the TL; noise, order parameter and example distribution 
estimation; larger architecture complexity. At present we don't know how to handle finite size effects. That the 
parameter estimation problem is probably easier than the others is suggested by the robustness found in 
The extension of the variational program to architectures experimentally more relevant, specially to include 
hidden nodes and soft transfer functions ^,^| has shown that this is a difficult task. The effects that drift 
may have are not known, but it could even induce faster breaking of the permutation symmetry among the 
hidden nodes, thus affecting the plateaus structure. 

Important remaining questions concern whether the variational approach can be success- 
fully applied to other learning models (radial basis functions, mixture models etc.). The 
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answers will help in determining the difference between universal and particular features of 
the learning systems. 
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